Search CORE

153 research outputs found

Generating a training corpus for OCR post-correction using encoder-decoder model

Author: D'hondt Eva
Grau Brigitte
Grouin Cyril
Publication venue: HAL CCSD
Publication date: 27/11/2017
Field of study

International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain

NLP Community Perspectives on Replicability.

Author: Cohen Kevin B
Fort Karën
Grouin Cyril
Mieskes Margot
Névéol Aurélie
Publication venue: HAL CCSD
Publication date: 01/09/2019
Field of study

International audienceWith recent efforts in drawing attention to the task of replicating and/or reproducing1 results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors, reviewers and community 's side

Proposal for an Extension of Traditional Named Entitites: from Guidelines to Evaluation, an Overview

Author: Fort Karen
Galibert Olivier
Grouin Cyril
Quintard Ludovic
Rosset Sophie
Zweigenbaum Pierre
Publication venue: HAL CCSD
Publication date: 23/06/2011
Field of study

International audienceWithin the framework of the construction of a fact database, we defined guidelines to extract named entities, using a taxonomy based on an extension of the usual named entities defini- tion. We thus defined new types of entities with broader coverage including substantive- based expressions. These extended named en- tities are hierarchical (with types and compo- nents) and compositional (with recursive type inclusion and metonymy annotation). Human annotators used these guidelines to annotate a 1.3M word broadcast news corpus in French. This article presents the definition and novelty of extended named entity annotation guide- lines, the human annotation of a global corpus and of a mini reference corpus, and the evalu- ation of annotations through the computation of inter-annotator agreement. Finally, we dis- cuss our approach and the computed results, and outline further work

HAL-Paris 13

Hal-Diderot

Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers

Author: Fort Karen
Galibert Olivier
Grouin Cyril
Kahn Juliette
Rosset Sophie
Zweigenbaum Pierre
Publication venue: HAL CCSD
Publication date: 12/07/2012
Field of study

International audienceThis paper compares the reference annotation of structured named entities in two corpora with different origins and properties. It ad- dresses two questions linked to such a comparison. On the one hand, what specific issues were raised by reusing the same annotation scheme on a corpus that differs from the first in terms of media and that predates it by more than a century? On the other hand, what contrasts were observed in the resulting annotations across the two corpora

HAL-Paris 13

Recherche et extraction d'information dans des cas cliniques. Présentation de la campagne d'évaluation DEFT 2019

Author: Claveau Vincent
Grabar Natalia
Grouin Cyril
Hamon Thierry
Publication venue: HAL CCSD
Publication date: 02/07/2019
Field of study

International audienc

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Paris 13

HAL-Rennes 1

Corpus annoté de cas cliniques en français

Author: Claveau Vincent
Grabar Natalia
Grouin Cyril
Hamon Thierry
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

International audienc

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Paris 13

HAL-Rennes 1

Approches à base de fréquences pour la simplification lexicale

Author: Bernhard Delphine
Garcia-Fernandez Anne
Grouin Cyril
Ligozat Anne-Laure
Publication venue: HAL CCSD
Publication date: 17/06/2013
Field of study

National audienceLa simplification lexicale consiste à remplacer des mots ou des phrases par leur équivalent plus simple. Dans cet article, nous présentons trois modèles de simplification lexicale, fondés sur différents critères qui font qu'un mot est plus simple à lire et à comprendre qu'un autre. Nous avons testé différentes tailles de contextes autour du mot étudié : absence de contexte avec un modèle fondé sur des fréquences de termes dans un corpus d'anglais simplifié ; quelques mots de contexte au moyen de probabilités à base de n-grammes issus de données du web ; et le contexte étendu avec un modèle fondé sur les fréquences de cooccurrences. ABSTRACT Studying frequency-based approaches to process lexical simplification Lexical simplification aims at replacing words or phrases by simpler equivalents. In this paper, we present three models for lexical simplification, focusing on the criteria that make one word simpler to read and understand than another. We tested different contexts of the considered word : no context, with a model based on word frequencies in a simplified English corpus ; a few words context, with n-grams probabilites on Web data, and an extended context, with a model based on co-occurrence frequencies. MOTS-CLÉS : simplification lexicale, fréquence lexicale, modèle de langue

A corpus for studying full answer justification

Author: Barbier Vincent
Ferret Olivier
Grappy Arnaud
Grau Brigitte
Grouin Cyril
Moriceau Véronique
Robba Isabelle
Tannier Xavier
Vilnat Anne
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceQuestion answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developing a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interested in pinpointing the set of information that allows verifying the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form : anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer’s disposal to study them, we decided to build an annotated corpus

Actes de la conférence Traitement Automatique de la Langue Naturelle, TALN 2018: Volume 2 : Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT

Author: Cellier Peggy
Claveau Vincent
Grouin Cyril
Ligozat Anne-Laure
Minard Anne-Lyse
Paroubek Patrick
Publication venue: HAL CCSD
Publication date: 14/05/2018
Field of study

International audienc

INRIA a CCSD electronic archive server